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Abstract. We present probabilistic arithmetic automata (PAAs), a gen- 
eral model to describe chains of operations whose operands depend on chance, 
along with two different algorithms to exactly calculate the distribution of 
the results obtained by such probabilistic calculations. PAAs provide a unify- 
ing framework to approach many problems arising in computational biology 
and elsewhere. Here, we present five different applications, namely (1) pat- 
tern matching statistics on random texts, including the computation of the 
distribution of occurrence counts, waiting time and clump size under HMM 
background models; (2) exact analysis of window-based pattern matching al- 
gorithms; (3) sensitivity of filtration seeds used to detect candidate sequence 
alignments; (4) length and mass statistics of peptide fragments resulting from 
enzymatic cleavage reactions; and (5) read length statistics of 454 sequenc- 
ing reads. The diversity of these applications indicates the flexibility and 
unifying character of the presented framework. 

While the construction of a PAA depends on the particular application, we 
single out a frequently applicable construction method for pattern statistics: 
We introduce deterministic arithmetic automata (DAAs) to model determin- 
istic calculations on sequences, and demonstrate how to construct a PAA 
from a given DAA and a finite-memory random text model. We show how 
to transform a finite automaton into a DAA and then into the corresponding 
PAA. 
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1 Introduction 



In many applications, processes can be modeled as chains of operations working on 
operands that are drawn probabilistically. As an example, let us consider a simple dice 
game. Suppose you have a bag containing three dice, a 6-faced, a 12-faced, and a 20- 
faced die. Now a die is drawn from the bag, rolled, and put back. This procedure is 
repeated n times. In the end one may, for example, be interested in the distribution 
of the maximum number observed. Many variants can be thought of, for instance, we 
might start with a value of and each die might be associated with an operation, e.g. 
the spots seen on the 6-faced die might be subtracted from the current value and the 
spots on the 12-faced and 20-faced dice might be added. In addition to the distribution 
of values after n rolls, we can ask for the distribution of the waiting time for reaching a 
value above a given threshold. 

The goal of this article is to establish a general formal framework, referred to as 
probabilistic arithmetic automata (PA As), to directly model such systems and answer 
the posed questions. We emphasize that we are not interested in simulation studies 
or approximations to these distributions, but in an exact computation up to machine 
accuracy. Further, we show that problems from diverse applications, especially from 
computational biology, can be conveniently solved with PAAs in a unified way, whereas 
they are so far treated heterogeneously in the literature. This article is a substantially 
revised and augmented version of several extended abstracts that introduced the PAA 
framework [UJ and outlined some of the applications presented here [291 [23j HH [50] . 

Let us give an overview of the application domains of PAAs considered in this paper. 
We begin with the field of pattern matching statistics. Biological sequence analysis is 
often concerned with the search for structure in long strings like DNA, RNA or amino 
acid sequences. Frequently, "search for structure" means to look for patterns that occur 
very often. An important point in this process is to sensibly define a notion of "very 
often". One option is to consult the statistical significance of an event: Suppose we 
have found a certain pattern k times in a given sequence. What is the probability of 
observing k or more matches just by chance? The answer to this question depends 
on the used null model, i.e. the notion of "by chance". It turns out that the PAA 
framework paves the way to using quite general null models; finite-memory text models 
as used in this article comprise i.i.d. models, Markovian models of arbitrary order and 
character-emitting hidden Markov models (HMMs). 

The PAA framework can also be applied to the exact analysis of algorithms. Tradi- 
tionally, best case, average case, and worst case behavior of algorithms are considered. 
In contrast, we construct PAAs to compute the whole exact distribution of costs of arbi- 
trary window-based pattern matching algorithms like Horspool's or Sunday's algorithm. 
For these algorithm, we present (perhaps surprising) exemplary results on short patterns 
and moderate text lengths. 

Another application arises when searching for a (biological) query sequence, such as a 
DNA or protein sequence, in a comprehensive database. The goal is to quickly retrieve 
all sufficiently similar sequences. Heuristic methods use so-called alignment seeds in 
order to first detect candidate sequences which are then investigated more carefully. To 
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evaluate the quality of such a seed, one computes its sensitivity or hitting probability, 
i.e. the fraction of all desired target sequences it hits. Similarly, we ask which fraction 
of non-related sequences is hit by a seed by chance, as this quantity directly translates 
into unnecessary subsequent alignment work. The stochasticity arises from directly 
modelling alignments of similar (or non-similar) sequences rather than modelling the 
sequences themselves. We use the PAA framework to compute the match distribution 
and, in particular, the sensitivity of certain filtration seeds under finite-memory null 
models. 

Next, we investigate protein identification by mass spectrometric analysis. If one 
can describe the typical length and mass of peptide fragments measured by the mass 
spectrometer, one can define a reliable comparison of so-called peptide mass fingerprints, 
based on an underlying null model. In this article, we compute the joint length-mass 
distribution of random peptide fragments. Furthermore, we calculate the occurrence 
probability of fragments in a given mass range, i.e., the probability that cleaving a 
random protein of a given length yields at least one fragment within this mass range. 
This probability aids the interpretation of mass spectra as it gives the significance of a 
measured peak. Again, our framework permits to use arbitrary finite-memory protein 
models. 

The final application concerns DNA sequencing: The task of determining a DNA 
sequence has seen great technological progress over the last decades. For a particular 
sequencing technology ("454 sequencing"), the length of a sequenced DNA fragment 
depends on the order of its characters. We compute the exact length distribution of 
such fragments. With this distribution we can specify the technology settings that yield 
the longest reads on average if statistical properties of the genome under consideration 
are known, improving the sequencing performance by up to 10%. 

Most of the mentioned applications have, in some form or another, previously been 
discussed in the literature, but never been identified as instances of the same abstract 
scheme. The contribution of this article is twofold. On the one hand, we introduce a 
generic framework unifying the view on the presented applications. On the other hand, 
we show that, in all considered applications, not only known results can be reproduced 
using PAAs, but new achievements are made. Since the range of applications is quite 
diverse, we give further references to relevant literature in each section separately. 

Organization of the Article. In the first part of this article we set up the PAA frame- 
work. Specifically, we formally introduce PAAs in Section [2] and give generic algorithms 
in Section [3j We consider waiting time problems on PAAs in Section HI In Section [5J 
deterministic arithmetic automata and finite-memory text models are defined and shown 
to be a convenient means of specifying a PAA. 

Having the framework in place, we explore several application domains. Section [6] 
deals with the statistics of patterns on random texts. In Section PAAs are employed 
for the analysis of window-based pattern matching algorithms. We cover the field of 
alignment seed statistics in Section [SJ Then, in Section HJ we apply PAAs to mass 
statistics of fragments resulting from enzymatic digestion. The optimization of read 
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lengths in 454 sequencing is discussed in Section [TUJ We conclude the article with a 
summarizing discussion in Section [TT1 

The relations between this article and preliminary versions published in conference 
proceedings are as follows. Sections El El and parts of Section are based on [37]. 
Section |5] is based on [50] . Section 16.41 contains material from [38] . Sections UJ El and M 
are based on 50] . [23], and [29], respectively. 

An extended version of [50] has been submitted for consideration in a special issue 
of a journal, a preprint is available as [39]. There, the analysis of pattern matching 
algorithms is further generalized and applied to more algorithms. Here, we include a 
basic example in Section [JJ to give another example of the utility of PAAs. 

Notation and Conventions. The natural numbers (without zero) are denoted DM; to 
include zero, we write DM . Throughout the article, E is a finite alphabet; as usual, S* 
denotes the set of all finite strings. Indices are zero-based, i.e. s = s[0] . . . s[\s\ — 1] 
for s G £*. Substrings, prefixes, and suffixes are written s[i . . . j] := s[i] . . . s[j], s[..i] : = 
s[0] . . . s[i], and s[i..] :— s[i] . . . s[\s\ — 1), respectively. All stochastic processes considered 
in this article are discrete. Therefore, appropriate probability spaces can always be 
constructed; we do not clutter notation by stating them explicitly. By P, we refer to a 
probability measure; C(X) denotes the distribution of the random variable X. Iverson 
brackets are written [•], i.e. {AJ = 1 if the statement A is true and {A} = otherwise. 

2 Probabilistic Arithmetic Automata 

In this section, we define probabilistic arithmetic automata (PAAs) in order to formalize 
chains of operations with probabilistic operands. PAAs can be interpreted as generalized 
Markov Additive Processes (MAP) [18, 19J in the discrete case. 

Definition 2.1 (Probabilistic Arithmetic Automaton). A probabilistic arithmetic au- 
tomaton V is a tuple 

V = (Q, g , T, V, v , S,fi= (ji q )qeQ, = (9 q ) qeQ ) , 

where 

• Q is a finite set of states, 

• <?o € Q is called start state, 

• T : Q x Q — > [0, 1] is a transition function with J2 q >eQ l') — 1 f or a M Q E Q, 
i.e. (T(q, q')) q , eQ is a stochastic matrix, 

• V is a set called value set, 

• v o G V is called start value, 

• £ is a finite set called emission set, 
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Figure 1: Illustration of a PAA for the dice example. Each of the three dice is represented 
by a state (circles). Emission distributions are depicted as gray boxes. Each 
arrow stands for a possible state transition, each transition having a probability 
of 1/3. The state's operations are given in triangles next to the state. For the 
start state, emission distribution and operation are not drawn. 

• each fi q : £ — >■ [0, 1] is an emission distribution associated with state q, 

• each 8 q : V x £ — >■ V is an operation associated with state q. 

We attach the following semantics: At first, the automaton is in its start state qo, 
as for a classical deterministic finite automaton (DFA). In a DFA, the transitions are 
triggered by input symbols. In a PAA, the transitions are purely probabilistic; T(q, q') 
gives the chance of going from state q to state q'. Note that the tuple (Q, T, S qo ) defines 
a Markov chain on state set Q with transition matrix T, where the initial distribution 
8 qo is the Dirac distribution assigning probability 1 to {qo}- 

While going from state to state, a PAA performs a chain of calculations on a set of 
values V. It starts with value v . Whenever a state transition is made, the entered 
state, say state q, generates an emission from £ according to the distribution ji q . The 
current value and this emission are then subject to the operation 6 q , resulting in the 
next value from the value set V. Notice that the Markov chain (Q, T, 8 qo ), together with 
the emission set £ and the distributions fi = (fi q ) q( zQ, defines a hidden Markov model 
(HMM). In the context of HMMs, however, the focus usually rests on the sequence of 
emissions, whereas we are interested in the value resulting from a chain of operations on 
these emissions. 

By introducing PAAs, we emphasize that many applications can naturally be modelled 
as a chain of operations whose result is of interest. When compared to Markov chains, 
PAAs do not offer an increase in expressive power. In fact, from a theoretical point of 
view, every PAA might be seen as a Markov chain on the state space Q x V. Thus, we 
advocate PAAs not because of their expressive power but for their merits as a modelling 
technique. As we shall see, the framework lends itself to many applications and often 
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allows simple and intuitive problem formulations. Before we formalize the introduced 
semantics in Definition I2.3[ let us come back to the dice example from Section [TJ 

Example 2.2 (Dice). We model each of the three dice as a PAA state. All transition 
probabilities equal 1/3 and the emissions have uniform distributions over the number of 
faces of the respective dice. If we are interested in the maximum, each state 's operation 
is the maximum. In general, we can associate individual operations with each state, for 
instance: "sum the numbers from the 12- and 20-faced dice and subtract the numbers 
seen on the 6-faced die". The value set is 7L, the start value is naturally vq = 0. The 
corresponding PAA is illustrated in Figure Ui 

Definition 2.3 (Stochastic processes induced by a PAA). For a given PAA V = 
(<2, go, T, V, vo, £, /i, 0), we denote its state process by (Qf)teN - It is defined to be a 
Markov chain with Q$ = q and 

p {Qf+i = Qt+i | Qt = Qt, ■ ■ ■ , Qo = Qo) 
= P (Q? +1 = qt+i | Q? = qt) = T(q t , q t+1 ) 
for all q , . . . q t+ i G Q. Further, we define the emission process (Ef) t ^ through 
P(E? = e\QV = q ,...,Q? = q t ,E£ = e ,...,El 1 = e t _ 1 ) 
= P(E? = e\Q? = q)= N (e), 

i.e. the current emission depends solely on the current state. We use (Qf)t<=N an d 
{Ef) t ^ to define the process of values {V^ p ) t en resulting from the performed operations: 

V V = Vo and Vf = 6 Q v (V£ v Ef) . (3) 

If the considered PAA is clear from the context, we omit the superscript V and write 
(Qt)teMo; (V t )tm , and (E t ) tmo , respectively. 



3 Computing the State- Value Distributions of PAAs 

We describe two algorithms to compute the distribution of resulting values. In other 
words, we seek to calculate the distribution C(V n ) of the random variable V n for a given 
n. The idea is to compute the joint distribution C(Q n , V n ) and then to derive the sought 
distribution by marginalization: 

P(K = V) = P(Qn = q, V n = V) . (4) 

qeQ 

For the sake of a shorter notation, we define ft(q,v) := P(Qt = q, V t = v ) for t G i\lo, 
qE Q,v G V. 

A slight complication arises when V is infinite. However, for each t, the range of V t is 
finite, as it is a function of the states and emissions up to time t, and these are finite sets. 
We define Vt '■— range V t and $ n := maxcKt^n \Vt\. Clearly < (\Q\ ■ \£\) n . Therefore 
all actual computations are on finite sets. As we will see, in many applications, d n grows 
only polynomially (even linearly) with n. In the following, we shall understand V as the 
appropriate union of Vt sets. Running times of algorithms are given in terms of d n . 
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3.1 Basic Algorithm 

We discuss an algorithm to compute the distribution f n = C(Q n , V n ). A basic recurrence 
relation follows from Definitions 12.11 and 12.31 

Lemma 3.1 (State-value recurrence). For a given PAA, the state-value distribution can 
be computed by 

\lifq = q andv = v , 

fo(q,v) = < . (5) 

I (J otherwise , 

and 

/t+i(?^) = E E ft{q',v')-T{q\q)- ii q (e), (6) 
where 9~ x {v) denotes the inverse image set of v under 9 q . 

Proof. Equation (J5]) follows directly from ([[]) and fl3]). Let us verify Equation ([6]): 

ft+i(q, v) = P (Qt+i = q, V t+ i = v) 

= EEE P (Gh-i = V ^ = v 'Qt = <f, Vt = v', E t+1 = e) 
q'eQv'eV eee 

= EEE P @t+i = V M = u ' Et +i = e \Qt = q',Vt = v') ■ f t (q', v') 
q'eQv'eV ee£ 

= E E ElW e ) = ^1 • p (Qw = 9, = e | Q t = V* = «') • </) 

= E E f (Qt+i = q, E t+1 = e\Q t = g', V t = v'\ -M, v') ■ 
9'eQ(^, e ) 6 0-i(„) V ^ ' 

We further evaluate the expression (*): 

(*) =P (Q t+1 = q, E t+1 = e\Q t = q',V t = v') 

= P (E t+1 = e\Q t = q\ Q t+1 = q,V t = v') ■ P (Q t+1 = q\Q t = q',V t = v') , 



= fJ.q(e) =T(q',q) 

where (i) is true because of (J2J) and ([3]) and (zz) follows from the fact that (Qt)teN is a 
Markov chain. □ 

We start with the distribution f and calculate the subsequent distributions by apply- 
ing Equation fl6]) until we obtain the desired f n . A straightforward implementation of 
Equation (jHJ) results in a pull- strategy] that means each entry in the table representing 
ft+i is calculated by "pulling over" the required probabilities from table f t . Note that 
this approach makes it necessary to calculate 9~ x in a preprocessing step. In order to 
avoid this, we may implement a push- strategy, meaning that we iterate over all entries 
in f t rather than / f+1 and "push" the encountered summands over to the appropriate 
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Algorithm 1 PaaDist 



Input: /o = C(Qo, V ), n E M , space for two tables of size \Q\ x 
Output: f n = £(Q n , V n ) 
1: for t = 1 to n do 

2: initialize ft(q,v) = for all g G <2, t> G Vt 
3: for all g G Q and v G Vf_i do 
4: for all q' E Q and e E £ do 

5: v'^9 q/ (v,e) 

6: /*(?>') <- /t(g',u') + /t-i(?>«) •^(9,9') -^(e) 

7: end for 

8: end for 

9: end for 

10: return /„ 



places in table ft+i] in effect, we just change the order of summation. Algorithm [T] shows 
the push-strategy in detail. 

In the course of the computation, we have to store two distributions, ft and f t +i, at 
a time. Once f t +i is calculated, ft can be discarded. Since the table at time t has a size 
of \Q\ x |V t |, the total space consumption is ■ t? n )- Computing f t from f t -\ takes 

• \Vt\ + \Q\ 2 ' |Vt-i| ■ |«^|) time, as can be seen from Algorithm [TJ We arrive at the 
following lemma. 

Lemma 3.2. Given a PAA (Q,q ,T,V,v ,S, fi,9), the distribution of values C(V n ) can 
be computed in 0(n ■ \ Q\ 2 ■ d n ■ \£\) time and 0(\Q\ ■ d n ) space. 

3.2 Doubling Technique 

If n is large, executing the above algorithm may be slow. In this section, we present an 
alternative algorithm that may be favorable for large n. To derive this algorithm, we 
consider the conditional probability 

U®(qi,q2,vi,v 2 ) ■■= P(Qt +t = q2,V to+t = v 2 \ Qt = qi,V to = vt) . (7) 

Note that does not depend on to, because transition as well as emission probabilities 
do not change over "time" (a property called homogeneity). Once is known, we can 
simply read off the desired distribution C(Q n , V n ): 

P(Qn = q, Vn = v) = U {n) (g , q, v Q , v) . (8) 
The following lemma shows how can be computed. 

Lemma 3.3. Let (Q,q ,T,V,vo,£,fi,9) be a PAA and (Qt)teN an d (Vt)teN its state 
and value process, respectively. Then, 

U {1) (qi,q 2 ,v 1 ,v 2 ) = T(q 1 ,q 2 ) ■ ^ e "> ( 9 ) 

8q 2 (vi,e)=v 2 
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and, for all t\ G [No and t% G DMo, 

f/ (tl+t2) (gi, g 2 , v u v2) = E f/( * l) ^ 1 ' y i' u ') • f/(t2) ^'' <? 2 ' ^ U2 ) ■ ( 10 ) 

q'eQv'eV 

Using these recurrences, the distribution of values C(V n ) can be computed in 0(\ogn ■ 
\Q\ 3 ■ tffj time and 0(\Q\ 2 ■ i? 2 ) space. 

Proof. Equation (j5J) follows from Definition I2.3[ while Equation ({TO]) follows from the 
Chapman-Kolmogorov Equation for homogeneous Markov chains when the PAA is seen 
as a Markov chain with state space Q x V. Computing f/(* 1+ * 2 ) from f/^ 1 ^ and takes 
Q| 3 ■ -i? 3 ) time, as follows from Equation ffTOj) . On the other hand, one step suffices to 
obtain U {2t) from U {t) . Thus, we can compute all U^- 2 ) for < b < [log(n)] in [log(n)] 
steps, which in turn can be combined into in at most [log(n)] steps. □ 

We note that the doubling technique is asymptotically faster if $ n is o{^Jn/ logn) and 
Q and £ are fixed. 

4 Waiting Times 

Besides calculating the distribution of values after a fixed number of steps, we can ask 
for the distribution of the number of steps needed to reach a certain value or a certain 
state. Such waiting time problems play an important role in many applications. We 
discuss examples in sections M and [TUJ A classical treatment of waiting time problems 
is given in [21]. Applications to occurrence problems in texts are reviewed in [59] . 

Definition 4.1 (Waiting time for a value). The waiting time for a set of target values 
T C V is a random variable defined as Wqr '■= min{t G DMo | Vt G T} if this set is not 
empty, and defined as infinity otherwise. 

While P (Wj- > t) may be nonzero for all t G DM, we are frequently only interested in 
the distribution up to a fixed time n. Then, of course, the exact values of C(Wj-)(t) = 
P(Wj- = t) are unknown for t > n, but their total probability P(Wy > n) is known, and 
n is typically chosen such that this total probability remains below a desired threshold. 

Lemma 4.2. Let (Q, q , T, V, v , £, /i, 6) be a PAA and T C V. Then, the probabilities 
£(Wr)(0), . . . ,C{Wr)(n) can be computed in 0(n ■ \Q\ 2 ■ \£\ ■ (d n - |T|)) time and 
0(\ Q\ ■ (^n — space. Alternatively, this can be done using O (logn- |Q| 3 •("#„ — |T|) 3 ) 
time and 0(\Q\ 2 ■ {d n - \T\) 2 ) space. 

Proof. We construct a modified PAA by defining a new value set V := (V \ T) U {•, o}, 
assuming (without loss of generality) that •, o ^ V, and new operations 

{6 q (v, e) if v ^ {•, o} and 6 q (v, e) ^ T, 
• if v £ {«,o} and 9 q (v,e) G T, 

o if v G {•, o} 
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for all q £ Q. Let V{ be the modified value process. Using the modified PAA, the 
probability of waiting time t can be expressed as 

p(w T =t) = p(v: = .). 

Runtime and space bounds follow from Lemmas 13.21 and 13.31 □ 

Besides waiting for a set of values, we may also wait for a set of states. Since a 
PAA's state process does not depend on emission and value processes, the remainder 
of this section solely concerns the Markov chain (Q,T, 8 qo ), which is part of the PAA 
(<2, go, T, V, t>o, £, /i, 0). Waiting times in Markov chains are a well-studied topic with 
particular interest in pattern occurrences [SU] and queuing theory [13] . For complete- 
ness and implementation purposes within the PAA framework, we briefly restate the 
construction here. 

Definition 4.3 (Waiting time for a state). The waiting time for a set of target states 
S C Q is a random variable defined as Ws '■= min{t G DMo | Qt G S} if this set is not 
empty and defined as infinity otherwise. 

Lemma 4.4. Let (Q,q ,T,V,v ,£,fi,9) be a PAA, a : Q — > [0, 1] be a probability 
distribution on Q, and S C Q be a set of target states. Consider the Markov chain 
(Q,T,a), let (Q' t )teN be its state process, and let W' s := min{t G Hq\Q[ E S} be the 
waiting time for states S. Then C{W' S ){Q), . . . , C{W' s ){n) can be computed in 0{n ■ \ Q\ 2 ) 
time and 0(|<2|) space, or in 0{\ogn ■ \ Q\ 3 ) time and 0(\Q\ 2 ) space using the doubling 
technique. If a = 6 qo , then Ws = W s . 

Proof. As in the proof of Lemma I4.2[ we introduce an aggregation state • to replace S 
and an absorbing state o to "flush" •. Then P(W S = t) = P(Q' t = •). □ 

When a = d qo , the above lemma yields the waiting time for the first event of reaching 
one of the states in S. We further consider the waiting time of a return event 

W s ° := min{t G INI | Q to+t G <S}. 

If the Markov chain is aperiodic and irreducible, it has a unique stationary state distri- 
bution, against which the PAA state distribution converges exponentially fast. We can 
then use Lemma 14.41 to compute 

lim P(Wt = t' I Q t G S) for each t' G DM 
by choosing a in Lemma 14.41 as the stationary distribution restricted to S. 

5 PAAs Based on Random Sequences 

We discuss the construction of PAAs modelling the deterministic processing of random 
sequences. That means we assume to be given a mechanism that processes sequences 
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character by character and deterministically computes a value for a given string. A 
pattern matching algorithm that computes the number of matches in a given sequence 
might serve as an example. In this section, we ask for the distribution of resulting 
values when a deterministic computation is applied to random strings. Therefore we 
first define text models and deterministic arithmetic automata to represent random texts 
and deterministic computations, respectively, and then combine both into a PAA. 

5.1 Random Text Models 

Given an alphabet E, a random text is a stochastic process (S t )teN , where each St takes 
values in E. A text model P is a probability measure assigning probabilities to (sets of) 
strings. It is given by (consistently) specifying the probabilities P(Sq . . . S\ s \-i = s) for 
all s G E*. We only consider finite-memory models in this article which are formalized 
in the following definition. 

Definition 5.1 (Finite-memory text model). A finite-memory text model is a tuple 
(C, c , S, <f), where C is a finite state space (called context space,), c G C a start context, 
E an alphabet, and ip : C x E x C — > [0, 1] with Xlo-eE c'ec V 9 ! ' °"> c ') = 1 f or °^ c ^ C- The 
random variable giving the text model state after t steps is denoted C t with Co := Cq. A 
probability measure is now induced by stipulating 

n-l 

P(S . . . 5 n _i = s, Ci = ci, . . . , C n = c n ) := Yl vfa, s[i], Cj+i) 

i=0 

for all n G Mo, s G E n ; and (ci, . . . , c n ) G C n . 

The idea is that the model given by (C, cq, E, if) generates a random text by moving 
from context to context and emitting a character at each transition, where ip(c, a, d) 
is the probability of moving from context c to context d and thereby generating the 
letter a. 

Note that the probability P(So • • • S\ s \-\ = s) is obtained by marginalization over all 
context sequences that generate s. This can be efficiently done, using the decomposition 
of the following lemma. 

Lemma 5.2. Let (C, Co, S, ip) be a finite-memory text model. Then, 

P(S ...S n = sa, C n+1 =c) = J2 P ( S o ■ ■ ■ S n -i = s,C n = d) ■ ip(d, a, c) 

dec 

for all n G M 0; s G E n 7 o G E and ceC. 
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Proof. We have 



P(S ...S n = sa, C n+ i = c) 

-- 22 . . . S n = sa,Ci = Ci, . . . ,C n = c n , C n+1 = c) 



Cl,...,Cn 

n— 1 



= Y[(p(ci,s\i},c i+1 ) ■ ip(c n} a } c) 

c 1 ,...,c„ i=0 

= Y [ II ^( C i> S [^Q+l) I • V(Cn,Or,c) 

c n €C \ci,...,c„_i 1=0 / 

= P(5 • • • S n -i = s,C n = c n ) ■ y?(c n , a, c) . 
c n ec 

Renaming c n to d yields the claimed result. □ 

Similar text models are used in [56] . where they a called probability transducers. In 
the following, we refer to a finite-memory text model (C, cq, S, <p) simply as text model, 
as all text models considered in this article are special cases of Definition 15.11 

For an i.i.d. model, we set C = {e} and cp(e, a, e) = p a for each a G S, where p a is the 
occurrence probability of letter a (and e may be interpreted as an empty context). For 
a Markovian text model of order r, the distribution of the next character depends only 
on the r preceding characters (fewer at the beginning); thus we set C := U[=o^*- ^ ne 
conditional follow-up probabilities are given by 

P(Si = s\i] | S i - 1 = s[i-l],...,S = s[0\) 

(p(s[..i — 1], s[i], s[..i\) if % < r , 

(p{s[i — r . . . i — 1], s[i], s[i — r + 1 . . . i]) if i > r . 

This notion of text models also covers variable order Markov chains as introduced in [62J, 
which can be converted into equivalent models of fixed order. Text models as defined 
above have the same expressive power as character-emitting HMMs, that means, they 
allow to construct the same probability distributions. For a given HMM, we can con- 
struct an equivalent text model by using the same state space (contexts) and setting 
<f(c,a,d) := T(c,d) ■ /v(o"), where T and /v are the HMM's transition function and 
emission distribution attached to state d, respectively. When, on the other hand, a text 
model (C, Co, E, <p) is given, we construct an equivalent HMM by using C 2 as state space 
and setting 

Eaes^( c 2,^4) if c 2 = c' 1 , 
otherwise, 



r((ci,c 2 ), (4,4)) := 
and 
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5.2 Deterministic Arithmetic Automata (DAAs) 

In order to model deterministic calculations on sequences, we define a deterministic 
counter-part to PAAs. 

Definition 5.3 (Deterministic Arithmetic Automaton, DAA). A deterministic arith- 
metic automaton is a tuple 

V = (Q,q ,E,5,V,v ,£, (Vq)qeQ, (8q)qeQ), 

where Q is a finite set of states, qo E Q is the start state, E is a finite alphabet, 5 : 
Q x £ — > Q is called transition function, V is a set of values, vq G V is called the 
start value, S is a finite set of emissions, r\ q G £ is the emission associated to state q, 
and 9 q : V x £ — >■ V is a binary operation associated to state q. Further, we define the 
associated joint transition function 

5 : (Q x V) x E (Q x V), S((q,v),a) := (S(q,a) , %1<T )M%,«r)))- 

VKe extend the definitions of 5 and 5 inductively to S* m i/iezr second argument by setting 

S(q,e) := q for the empty string e, 

S(q, xa) := S(S(q, x), a) for all i£E' and c G S, 

:=5(5((g,u),a;),<T). 

WTien 5((go, ^o) ? s) = (9)^) /or some g 6 Q and s G S* ; we say ina£ V computes 
value v for input s and define valuers) := v. 

Informally, a DAA starts with the state-value pair (qo,vo) and reads a sequence of 
symbols from E. Being in state q with value v, upon reading o G S, the DAA performs 
a state transition to g' := <5(g, cr) and updates the value to f' := O q '(v,r) q ') using the 
operation and emission of the new state q' . 

For each state q, the emission r\ q is fixed and could be dropped from the definition of 
DAAs. In fact, one could also dispense with values and operations entirely and define a 
DFA over state space Q x V, performing the same operations as a DAA. However, we 
intentionally include values, operations, and emissions to emphasize the connection to 
PAAs. 

5.3 Constructing PAAs from DAAs and Text Models 

We now formally state how to convert a DAA into a (restricted) PAA, where each 
emission distribution is deterministic (assigning probability 1 to a particular value), 
given a text model. 

Lemma 5.4 (DAA + Text model — > PAA). Let (C,co, £,</?) be a text model and V = 
{Q V , 00,^,8, V,v ,£, (n g )g eQ c, (Of)qeQp) be a DAA. Then, define 
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• a state space Q := Q x C 



• a start state go : = (Qq, cq) 



• transition probabilities 



T({q,c),(q',c' 



))■■ 



(11) 



ueS: 8{q,o)=q' 



• (deterministic) emission probability vectors 



/%c)(e) : 



1 ife = Vq, 
otherwise , 



/or a// (g, c) G Q. 

• operations 6( q<c ){v,e) := 9®(v,e) for all (g, c) G Q. 

T/ien, P = (Q, g , T, V, v , £,/x= (n g )qeQ, $ = (#q)<?eg) is a PAA with 

C(V t ) = C(value v (S ■ ■ ■ St-i)) 

/or a// i G DMo, where S is a random text according to the text model (C, cq, E, <^). 

Proof. V is a PAA by Definition 12.11 As in Section [3j we define ft{q,v) := P(Qt = 
g, 14 = u). To prove C(V t ) = C(valuex>(So . . . S t -i)), we show that 



ft((q V ,c),v) = P((??^o),s) = (q v ,v)] ■ P(S ...S t ^ = s,C t = c) (12) 



for all q° G Q v , c G C, v G V, and t G DM . For t = 0, Equation (Q2]) is correct 
by definitions of PAAs, DAAs and text models. For it > we prove it inductively. 
Assume f )12p to be correct for all t' with < t' < t. 




c),v) 



(13) 



= :<? 






(14) 




(15) 



q'eQ (v',e)eVx£ 



E E E P?.(« / ,e) = «]-[^ = e]-/*-i(g',iO 



q rD & QV c 'eC (v',e)£Vx£ 



(16) 



•E H^^) = fl^(c,a,c) 
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= E E E E E («'.<=) = »]• 1^ = 4 

ses'- 1 o-es q'^es 1 ' CeC (t;',e)evx£ 

■p(g rt, ,a) = gT.p((gg , I « b ) ia ) = ( g «' I TO] 
•P(S'o...5 t _ 2 = S! C t - 1 = c / )-^(c , ,a,c) 

= E E E m^e) = v]-lr kD = e}-[5(^,v ) 1 s) 

scrgS* q lT >eQ T) (v',e)eVxS 

■ W v , a) = q v \ ■ P(S . . . St^ = sa, C t = c) 

= E m(qo^o),sa) = (q v ,v)]-P(S ...S t ^ = S( r,C t = c) 

In the above derivation, step (TT3|) — >-( lT4"|) follows from Step (IT4")) — »-( lT5]) follows 

from the definitions of 6 q and \i q . Step (1X5"]) — »- (IT6"l) uses the definitions of T and Q 
in Lemma 15.41 Step ffT6]) — > f TTTj) uses the induction assumption. Step (TT7|) — »-( JT5]) uses 
Lemma 15.21 The final step ( jl8jl — >- ()19p follows by combining the four Iverson brackets 
summed over q m and (V, e) into a single Iverson bracket. □ 

Remark 5.5. In the above lemma, states having zero probability of being reached from 
go may be omitted from Q and T. 

Lemma 5.6 (PAA from DAA; Construction time and space). 

1. For a PAA constructed according to Lemma \5.4\ the value distribution C(V n ), or 
the joint state-value distribution, can be computed with 0(n ■ \ Q V \ ■ |E| ■ \C\ 2 ■ i? n ) 
operations using 0(\Q V \ ■ \C\ ■ i? n ) space. The same statement holds for computing 
the waiting time distribution up to time n. 

2. If for all c G C and <j6E, there exists at most one d G C such that tp(c, a, d) > 0, 
then the time is bounded by 0(n ■ \ Q V \ ■ |E| ■ \C\ ■ i9 n ). 

3. Using the doubling technique, the distributions can be computed in 0(logn- \Q V \ 3 ■ 
|C| 3 ■ time and 0(\Q V \ 2 ■ \C\ 2 ■ ti*) space. 

Proof. 

1. From Lemma [3 .2\ we obtain bounds for time and space complexity of 0(n ■ \ Q\ 2 ■ 
■& n - \S\) and -fin), respectively. By construction, \Q\ < \Q!?\-\C\. Recall that 

Lemma 13.21 is based on Algorithm [TJ The loops in lines [T] and [3] together account 
for a factor of 0{n- \ Q\-'d n ) in the time complexity. A factor of 0(\ Q\ ■ \£\) is caused 
by the inner loop in line HI However, since the constructed PAA has deterministic 
(i.e. Dirac distributed) emissions, we do not need to iterate over all e G S and save 
a factor of \S\. Furthermore, we only need to iterate over all states reachable in 
one step. For each q G Q, there exist at most |E| • \C\ such states by construction 
of the PAA. Therefore, the inner loop in lineH]can be modified to take C?(|E| • \C\) 
time, yielding the claimed runtime bound. 



(17) 



(q ,v 



(18) 
(19) 
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2. If for all c £ C and a £ E, there exists at most one c' £ C such that y?(c, a, c') > 0, 
then at most |S| different states are reachable from each state q £ Q. The claimed 
runtime follows by the same arguments as above. 

3. Alternative time and space complexities for the doubling algorithm follow directly 
from Lemma [3.31 

□ 

This concludes the derivation of the PAA framework and construction methods. The 
following sections are devoted to several applications. 

6 Pattern Matching Statistics 

One application of the introduced framework are pattern matching statistics. An al- 
gorithm that searches for a pattern p maps strings to the number of matches. That 
means, it deterministically processes a string and, by doing so, computes a value. In 
this section, we ask for the distribution of the number of occurrences of a given pattern 
in a random text. 

In computational biology, one searches for patterns that occur often and hypothesizes 
that these patterns carry biological meaning. Mere abundance, however, does not neces- 
sarily imply that the found pattern is meaningful. Short patterns like AC and degenerate 
patterns like ANNNNNNC will naturally occur quite often in a given stretch of DNA (here 
N is a wildcard character meaning aNy nucleotide). A better approach to quantify a 
pattern's overrepresentation is to consult the statistical significance: Suppose we have 
found a certain pattern k times in a given sequence. What is the probability of observing 
k or more matches just by chance? Precisely this question can be answered by comput- 
ing the distribution of occurrence counts. Given a suitable null model, a procedure to 
compute the significance of a pattern is a powerful tool in the context of motif discovery, 
as it allows the comparison of different patterns regardless of their structure and length. 

There are many different types of patterns that are relevant in computational biology, 
such as single strings, sets of strings, Prosite patterns^, consensus strings together with 
a distance measure and a distance threshold, abelian patterns, position weight matrices 
in connection with a threshold, etc. All these pattern types may be seen as ways to 
concisely describe a finite sets of strings. Thus, all these patterns may be expressed in 
the form of deterministic finite automata (DFAs) that recognize the respective string 
set. As our method is based on this DFA representation, it is very general and flexible 
regarding the pattern type. 

Besides specifying a pattern, one has to decide how overlaps are to be handled. We 
refer to the used strategy as counting scheme. The easiest case is to disallow overlaps at 
all; we call this scheme non-overlapping count. Consequently, we define the overlapping 
count to be the number of substrings that match the given pattern. In the case of a 

1 used in the Prosite database (see Hulo et al. [25]). A syntax description can be found under 
http : / / www . expasy . org/tools/ scanprosite/scanprosite-doc . html. 
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set of strings without any restrictions, this scheme makes counting more complicated 
as some words may be substrings of others. Many authors avoid the problem — at least 
partly — by simply counting the positions where at least one pattern ends, which we refer 
to as match position count. 

6.1 Related Work and New Results 

The topic of statistics of words on random texts has been studied extensively. An 
overview is provided in the book by Lothaire [10] . Chapter 6 ( "Statistics on Words with 
Applications to Biological Sequences"), which is particularly interesting in this context, 
is based on the overview article by Reinert et al. [59J. 

In many approaches, a generating function is derived for the sought quantity. Then, 
typically using symbolic Taylor expansion, the concrete values can be computed. This 
procedure is, for instance, described by Regnier [SB], who gives formulas for mean, 
variance and higher statistical moments of the exact occurrence count distribution. This 
approach has the advantage of additionally allowing asymptotic analysis. Her framework 
is general enough to admit Markovian sources as well as finite sets of patterns to be 
treated in the overlapping as well as in the non-overlapping case. Closely related is 
the approach of Nicodeme et al. [53j, who present an algorithmic chain to compute 
the distribution of the match position count for regular expressions. Lladser et al. [39] 
recently reviewed the field. Their main concern is to bring together involved concepts 
in a consistent and rigorous manner. They make the connection to the classical field 
of automata theory and pattern matching explicit and therefore speak of probabilistic 
pattern matching. Furthermore, they describe the relation between finite automata and 
Markov chains in terms of the Markov chain embedding technique. Related approaches 
to compute the exact p- values based on automata are developed in [9j [55]. Another 
dynamic programming approach was presented in [71]. It is used to compute exact p- 
values for position weight matrices describing transcription factor binding sites (TFBS). 

In this section, we provide a unifying framework for the efficient computation of pat- 
tern matching statistics. In contrast to existing approaches, overlaps are handled cor- 
rectly and arbitrary finite-memory text models (including HMMs) can be used. 

6.2 Constructing DAAs from DFAs 

As usual, we define a deterministic finite automaton (DFA) to be a tuple (Q, S, 5, q , J 7 ), 
where Q is a finite state space, E is a finite alphabet, 5 : Q x H — > Q is a transition 
function, q is a start state, and T C Q is a set of accepting states. Again, we extend S 
to X* in its second argument, as for DAAs in Definition 15.31 Suppose a pattern is given 
in the form of a DFA, that means, a string is an instance of the pattern if it is accepted 
by the DFA. We seek to calculate the distribution of the number of occurrences of this 
pattern. 

To use a DFA for pattern matching, that is, to find all instances of a pattern in a (long) 
text, we construct an automaton that accepts not only all strings matching the pattern, 
but all strings that have a suffix matching the pattern (see [52]). For a pattern given in 
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the form of a DFA or non-deterministic finite automaton (NFA), it is always possible to 
construct a DFA for this task as follows. First, we add a self-transition labeled with the 
whole alphabet E to the automaton's start state to ensure that the start state remains 
active all the time. The result is an NFA, which can be made deterministic again by 
employing the classical subset construction (explained for example in [52] ) . As we see in 
Sections 16.2. II to f6.2.2|. there are often simpler and more direct ways to build the sought 



When a DFA reads a text, it is in an accepting state whenever (at least) one instance 
of the pattern ends. The number of these events equals the match position count as 
defined above. To count the number of times a DFA (Q, E, 5, go, F) is in an accepting 
state, we define emissions 



We call a tuple (Q, E, 5, go, ( r ]q)qeo) counting DFA. 

To compute the distribution of the overlapping count instead of the match position 
count, we have to take into account that more than one match can end at a position in 
the text. For each state, we therefore need to know how many matches end when it is 
entered. As we will see, this is never a problem when the pattern represents a finite set 
of string^!. Formally, we define emissions rj = (rj q ) qe Q, where r\ q G DMo gives the number 
of matches to be counted upon entering state q. 

To obtain the nonoverlapping count, the automaton can be modified accordingly: We 
change the outgoing transitions of each accepting state of the match position count 
automaton to act as if they originate from the start state go- Therefore, we define a 
modified transition function 5' by S'(q,a) := S(q,a) if g ^ F, and 8'(q,a) := 8(q ,a) 
otherwise. 

As described above, all three counting schemes can be realized by counting DFAs when 
the pattern is a finite set of strings. The main idea now is to construct a DAA from 
this counting DFA by adding the emissions generated in each state, and turn this into a 
PAA using Lemma 15.41 For practical computations, it is often sufficient to truncate the 
values at a constant M £ N. (Note that the match position count is always bounded 
by the length of the processed text, so $ n = 0(n).) The proof of the following theorem 
contains the details of the DFA— )-DAA— >-PAA construction. 

Theorem 6.1. Let a counting DFA D = (Q, E, S, go, (Vq)qeQ) ana " a ^xt model (C, Co, E, ip) 

be given, and let (St)ten be a random text distributed according to that model. Then, 
the (truncated) distribution of accumulated counts 



2 If the pattern represents an infinite set, such as a regular expression containing a star operator, there 
may be states in J- that can be entered when i or j matches end for i ^ j, or for which the number 
of overlapping matches to be counted might be unbounded. In this case, one is essentially restricted 
to the match position count. Since we only consider finite sets, this is not an issue here. 



DFA. 




1 ifqeF, 
otherwise. 




(20) 
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can be computed in 0(n ■ \ Q\ ■ \C\ 2 ■ |E| ■ M) time and 0(\Q\ ■ \C\ ■ M) space. If for all 
c G C and o G E, i/iere exzsfo at most one d G C such that <p(c, a, d) > 0, then the 
runtime is bounded by 0(n ■ \ Q\ ■ \C\ ■ |E| • M). Alternatively, (T2T)j) can be computed in 
0(\ogn- \Q\ 3 ■ \C\ 3 ■ M 2 ) time and 0(\Q\ 2 ■ \C\ 2 ■ M) space. 

Proof. Given the counting DFA (Q, E, 8, q , (T) q ) q eQ), we use ^ to construct the DAA 
V = (Q,qo,^,S,V,v ,E,(T] q ) qe Q,(e q ) qe Q), where V := {0,...,M}, v := 0, £ := {iq q : 
q G <2}, and all operations are truncated additions: 



' M if r ■ ( > M . 
v + e otherwise 



9 (t>,e) := 

for all g G Q. For this DAA P, we have 

{n-l 
i=0 

for all s G E*. 

We apply Lemma 15.41 to this DAA and text model in order to construct the PAA. 
The runtime and space bounds for the basic algorithm follow directly. To obtain the 
bounds for the alternative doubling algorithm, namely 0(logn ■ \ Q\ 3 ■ \C\ 3 ■ M 2 ) time 
and 0{\ Q\ 2 ■ \C\ 2 ■ M) space, we exploit that the operations 9 q are (almost) additions in 
this case. Thus, U®(q\, q2, i>i, v 2 ) = U^\qi,q2,V3,V4) if v i < t>2 < M, t> 3 < t> 4 < M and 
V2 — v\ = t>4 — t>3. Thus, we can fix v i = and thereby save a factor of |V| = M + 1 in 
time and space. The special cases for v 2 = M or v 4 = M can be accommodated in the 
same bounds. □ 

Before turning the DFA into a PAA, one may wish to minimize it. Using an algorithm 
by Hopcroft [23], a classical DFA can be minimized in 0(\ Q\ log \ Q\) time for an alphabet 
of constant size, where Q is the set of states. Refer to Knuutila [32] for a tutorial-like 
introduction and a variant that runs in 0(|E| ■ \ Q\ log \ Q\) time when the alphabet size 
is not considered to be a constant. Hopcroft's algorithm can be adapted to minimize 
counting DFAs by using the partition induced by the different emissions as an initial 
partition, i.e. states with the same emission are grouped together. 

In the following, we review some concrete pattern classes important in practice, par- 
ticularly in computational biology. 



6.2.1 Finite Sets of Strings 

Assume that the pattern is given in the form of a finite set of strings. In this situation, 
an Aho-Corasick automaton [TJ, which essentially is a DFA, can be built. It can be 
constructed in linear time by either using the algorithm given in the original paper or by 
employing a recent elegant algorithm based on the suffix tree of the reverse strings [20J. 
The emissions rj q (number of matches) can directly be read off the Aho-Corasick au- 
tomaton's output function for all states q. 
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(a) PAA for overlapping hits 



(b) PAA for non-overlapping hits 



Figure 2: PAAs for the distribution of matches of the pattern set {101, 111}, assuming 
an i.i.d. text model over alphabet S = {0, 1} with probability p for character 
1, and q := 1 — p. The start state is denoted e. Each state is associated with 
the operation "+" and a Dirac emission distribution, shown in the gray boxes. 



Figure |5] shows a PAA computing the distribution of the number of (a) overlapping 
matches and (b) non-overlapping matches of the pattern set {101, 111} in an i.i.d. text 
model over the alphabet {0, 1}, where the probability of seeing 1 is p. 

6.2.2 Finite Sets of Generalized Strings 

Generalized strings are finite sequences of sets of characters over an alphabet S, for ex- 
ample [abc] [ac] [ab] (which matches aaa, ccb but not aba). Rather than enumerating 
all strings matching a generalized string, we can directly construct an NFA that recog- 
nizes all strings ending with an instance of it. The NFA corresponding to one generalized 
string is just a linear chain of states; a start state plus one state for each position, where 
the start state is additionally equipped with a self-transition. The NFA for the set of 
generalized strings can be constructed by merging all individual start states into one 
common start state. The next step towards a PAA is to build a DFA. To obtain a DFA, 
we employ the classical subset construction. Although, in the worst case, it results in an 
exponential increase in the number of states, this method is feasible in many practical 
cases. In fact, the construction procedure can be modified such that it always results in 
the minimal DFA [36]. By the subset construction, a set B q of NFA states corresponds 
to each DFA state q. The number of final NFA states in B q equals the number of match- 
ing generalized strings that end when DFA state q is entered, giving us the number of 
matches r\ q to be emitted by q. 

Prosite Patterns 

Prosite is a database of biologically meaningful amino acid motifs (see Hulo et al. [26]). 
Prosite patterns can be seen as generalized strings with the extension that, for each po- 
sition, a "multiplicity range" can be specified. In the pattern A-x(2,3)-C, for example, 
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an A is followed by either two or three arbitrary characters followed by a C. We trans- 
late every Prosite pattern into a set of generalized strings. The above example would 
result in the two patterns A-x-x-C and A-x-x-x-C. This set can then be dealt with as 
explained above. 

We implemented the algorithms in Java and ran them on a Intel Core 2 Duo 2.66GHz, 
4GB RAM, running Linux to assess practicability. Release 20.17 of Prosite contains 
1319 patterns, 16 of which refer to the start or ending of a sequence. Those entries were 
ignored, leaving a database of 1303 patterns. For 42 patterns (3.2%) the computation 
did not succeed due to memory limitations. This can happen if either the Prosite 
pattern translates into too many generalized strings or if the DFA resulting from the 
subset construction grows too large. For 1236 of the 1261 remaining patterns, the subset 
construction was completed within 2 seconds while the computation took 69.9 seconds for 
the "worst pattern" . The resulting automata were minimized using Hopcroft's algorithm. 
Many automata, however, already were minimal or close to minimal; for 1209 automata 
the minimized automaton was larger than half the size of the original automaton. The 
majority of resulting minimal automata were of reasonable size: We obtained 1198 
automata with less than 10000 states, among which 1036 had less than 500 states. 

To give an impression of the runtimes to be expected when computing the distribution 
of the overlapping occurrence count, consider the pattern 

C-x-H-R- [GAR] -x (7,8)- [GEKVI] - [NERAQ] -x(4 , 5) -C-x- [FY] -H 

from the Prosite database. It results in an automaton with 462 states. Assuming M = 50 
(maximum number of occurrences of interest) and n = 1000 (text length), computing 
the distribution of the occurrence count took 1 second. 

6.3 Waiting Time for Pattern Occurrences 

The waiting time for the first occurrence of a pattern equals the waiting time for a state 
that emits a match. Therefore, its distribution can directly be computed by applying 
Lemma |4~41 The waiting time for a subsequent occurrence can be computed by choosing 
a in Lemma 14.41 to be the equilibrium distribution restricted to all match states, i.e. 
those states that emit a match. 

6.4 Clump Size Distribution 

We already saw how to use a DFA recognizing a given pattern to construct a PA A. 
Through this method, we could accurately account for possible self-overlaps of patterns. 
The structure of self-overlaps was implicitly encoded in the DFA. In this section, we 
explicitly work out a pattern's tendency to overlap itself by computing its clump size 
distribution As detailed in [48], the exact clump size distribution of a pattern is, 
besides its theoretical value, useful for the construction of compound Poisson approxi- 
mations. Compound Poisson approximations have also been discussed in [6T ] [TO] , 160]. 

Definition 6.2. Given a sequence s 6 X* and a pattern p, a clump is a maximal set of 
overlapping occurrences of p in s. 
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V dl Utib . -v v vfV v^" v°3" v°3" v°3" v°3" 



ACACACACTTAGGA 



Clump start position 
Clump end detected 



Figure 3: A clump (shaded gray) of three occurrences of the pattern ACAC. By definition, 
the clump starts at the last character of the first occurrence. The values 
computed by the clump size DAA are shown above the string. As soon as the 
second counter reaches m — 1 = 3, the clump has ended. 



For example, let p := ACA and s := G ACACA TT ACA AA. Then s contains three occurrences 
of p in two clumps (underlined). By definition, a clump consists of at least one match. 
We call the position of match's last character match position and consider the first match 
position in a clump. Further, we call the distribution of PAA states at such positions 
clump start distribution and denote it 7; i.e. given that j is the first match position 
in a clump, then P(Qj = q) =: 7(5), which is asymptotically independent of j under 
certain assumptions. For now, we assume 7 to be known and come back to the task of 
its calculation later. 

If m > 2 is the length of the given motif, then a clump ends if m — 1 consecutively 
visited states do not emit a match. That means we need to keep track of (a) the number 
of non-match states consecutively visited and (b) the number of matches the clump 
contains so far. The PAA framework allows this by modifying the PAAs described in 
the previous subsections. We define a new value set V := DMo x {0, . . . , m — 1, •} with 
the start value v' Q := (0,0) and attach the following semantics: If we are in state q and 
the current value is (h,x), we have seen h matches in the current clump and the last 
of these matches occurred x steps in the past; i.e. if x — 0, a match has been emitted 
from the current state. The special value x — • indicates that the clump has ended. We 
define the operations accordingly: 

{(h + e,0) if e > and x G {0, . . . , m - 2} , 
(h, x + 1) if e = and x E {0, . . . , m - 2} , 
(h, •) if x G {m — 1, •} . 

In other words, if a match has been found (e > 0), we increase the number of matches h 
by e and reset the distance to the last match to 0. Otherwise (e = 0, no match occurred), 
h remains unmodified, but the number of steps x since the last match is increased by one. 
See Figure [3] for an example. To incorporate the clump start distribution 7, we use one 
additional state q' that becomes the new start state; consequently, we set Q' := {q' }U Q 



22 



and define the new transition function to be 

T':(g,g')^{ 7(g/) . ifg = g °' (21) 
I T(q, q') otherwise . 

The set V is infinite. As discussed in Section El this does not pose a problem as 
the range of each V t is finite. Furthermore, for many applications it is sufficient to 
truncate the clump size distribution and use the value set V" := {1, . . . , M} x {0, . . . , m— 
1, •} along with adapted operations 9" Employing one of the algorithms shown in 
Sections f3 . 1 1 and l3~2| respectively, we can then calculate the joint state- value distributions 
p t (q, h, x) := P(Qt = q, V t = (h, x)). A clump ends if no new match has occurred m — 1 
steps after the previous match. The clump size distribution \1/ is thus given by 

oo 

t=0 q&Q 

To actually compute we start with the initial table po an d iteratively calculate the 
tables p t for larger t. Each p t contributes to the sought distribution through the inner 
sum from Equation (j22|) and we can successively add the contributions to an intermediate 
clump size distribution. Observe that the difference between the intermediate clump size 
distribution after iteration t and the exact one is bounded by 

M 

q&Q h=0 

Thus, we iterate until this quantity drops under an accuracy threshold. The number 
of necessary steps, however, is bounded by OiM ■ m), because a clump containing M 
matches can have a length of at most O(M-m). In total, we need \C\- \Q\ -M 2 -m?) 

time to compute the exact clump size distribution. Again, a factor of \C\ can be saved 
if for all c G C and a e E, there exists at most one c' G C such that (p(c, a, c') > 0. 

State Distribution at Clump Start 

Let us come back to computing the clump start distribution 7 needed in Equation (|2~TT) . 
As discussed in Section^ the PAA's state process (Qt)teu is a Markov chain and, hence, 
the classical theorems about existence of and convergence to an equilibrium distribution 
apply: Irreducibility and aperiodicity are sufficient for convergence to a unique equilib- 
rium distribution. Assuming (a) that a pattern does not start with a wildcard and (b) 
all text model states c G C have positive occurrence probability, these conditions can be 
verified to be fulfilled by construction of the PAA. 

We consider the joint distribution of state and steps since the last match position. 
Recall that E t is the emission process of a PAA. We define L t as the number of steps 
since we last encountered a match before step t. Thus 

L t :=mm{t' G {!,..., t} \ E t _ t , > 0}, 
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where we set L t := — oo if the set is empty, meaning that no match has occurred until 
step t — 1. Again we use the PAA framework to compute the joint state- value distribution 
C(Q t ,L t ) for any desired t. The clump start distribution is now given by 

7 (g) = \imP(Q t = q\L t >m,E t >0) . (23) 
In practice, the limits for t — > oo exist and converge in a few steps to double precision. 

7 Analysis of Window- Based Pattern Matching 
Algorithms 

The basic pattern matching problem is to find all occurrences of a pattern string in a 
(long) text string as fast as possible. Let n be the text length and m be the pattern length. 
The well-known Knuth-Morris-Pratt algorithm [31] reads each text character exactly 
once from left to right after preprocessing the pattern and needs a total of <d(n + m) 
character accesses. In contrast, the Boyer-Moore [10], Horspool [25] and Sunday [67] 
algorithms move a length-m search window across the text and first compare its last 
character to the last character of the pattern. This often allows to move the search 
window by more than one position (at best, by m positions if the last window character 
does not occur in the pattern at all), for a best case of 0(m + n/m) but a worst case 
of 0(m + mn) character accesses. The worst case can be improved to 0(m + n), but 
this makes the code more complicated and is seldom useful in practice. The Horspool 
algorithm and the variant of Sunday can be seen as modifications of the Boyer-Moore 
algorithm that are simpler to implement and additionally perform better in practice [52J . 
In general, a window-based algorithm that searches for a pattern p is characterized by 

• a window size z, 

• a cost function £^ : S 2 — > H giving the cost caused by a window, 

• a shift function shift p : H z — > {1, . . . , m} giving the number of positions the window 
can safely be shifted by. 

The total cost of processing a text s G S n is denoted £ p (s). In this section, we develop 
a methodology to calculate the exact distribution of £ p (So • • • Sn-i) when 5* is a random 
text and pattern p is fixed. This question has so far not been investigated, even though 
related questions have been answered in the literature. For example, [3 E] analyze the 
expected number of character accesses for Horspool's algorithm. In [12] it is shown that 
the number of character accesses is asymptotically normally distributed for i.i.d. texts, 
and [65J extends this result to Markovian text models. 

The technique we introduce here allows to compute the exact distribution of the total 
cost for arbitrary cost functions, general finite-memory text models as defined in Sec- 
tion |5l and applies to all window-based pattern matching algorithms. For concreteness, 
we consider Horspool's algorithm as given in Algorithm [21 that means, pattern and win- 
dow are compared from right to left and the shift solely depends on the window's last 
character. Formally, Horspool's algorithm is characterized by 
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Algorithm 2 Horspool 



Input: text s G £*, pattern p G E m 

Output: pair (number occ of occurrences of p in s, number cost of accesses to s) 
1: pre-compute table shift[a] for all <r G £ 
2: (occ, cost) <- (0, 0) 
3: t <- m-1 
4: while t < |s| do 

5: Z «- 

6: while i < m do 

7: cost cost + 1 

8: if s[t — z] 7^ p[(m — 1) — i] then break 

9: i <- i + 1 

10: end while 

11: if % = m then occ occ + 1 

12: t <- t+ s/w/it[s[t]] 

13: end while 

14: return (occ, cost) 



• z :— m (the window size equals the pattern length), 

. g>/ w \ . = \ m ifp = w, 

mm {i ■ 1 < i < m, p[m — i] 7^ w[m — i]} otherwise, 

• the shift depends on the position of the rightmost occurrence of w[m — 1] in p: 

right p (w) := max [{i G {0, . . . , m — 2} : = u;[m — 1]} U { — 1}] , 
shift p (w) := (m — 1) — right p (w) . 

Here, we have used the number of character accesses as a measure of cost. 

We now construct a PAA that allows to compute the cost distribution for arbitrary 
window-based pattern matching algorithms with respect to a random text of length n 
defined by a text model (C, c , S, cp). Note that we cannot construct a DAA and apply 
Lemma 15.4} as the considered algorithms can read several characters "at once" , while 
a DAA is only capable of processing one character at a time. In the defined model of 
window-based algorithms, shift and cost depend solely on the current window. Therefore, 
we model each possible window as a state (although concrete algorithms may permit 
smaller state spaces). Additionally, we need to keep track of the current text model 
state after generating the current window. We use the state space 

Q :={(£, c )} U(S 2 xC) , 

where (e, c ) =: go is the start state and any other state (w, c) corresponds to window 
content w and text model state c. Note that, if the text model is a Markovian model of 
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order r < z, then its state is fully determined by the current window. That means, only 
+ 1 different states are reachable, all others can be discarded. For arbitrary text 
models, however, more states might be reachable. The strategy to construct the PAA is 
to simulate the algorithm and count both the reached position in the text and the cost 
so far. To this end, we assume that the text length n is fixed and define the value set 

V:={0,...,n-f,.}x{0,...,n-O, 

where ^ := max{^(w) : w G S 2 }, with = m for Horspool's algorithm. Value 
v — (t, £) corresponds to the current window ending at text position t (with • indicating 
that the text has ended), having accumulated a cost of £ so far. The start value is set 
to v := (z — f , 0). 

Each state deterministically emits cost and shift of the associated window, therefore 
we set 

£:={l,...,m}x{0,...,e}, 

where an emission of (f, £') G £ indicates that the current window has caused a cost 
of £' and the window is to be shifted by t' . Formally, we define 



AV,c)((t',0) := 



f lit' = shift p (w) and f = ^(w) , 
otherwise , 



for all (w, c) G Q. Note that we could use other (non-Dirac) emission distributions 
if necessary. We might assume, for example, that a cache miss occurs with a certain 
probability and associate higher costs with this event. The operation 6 q in each state is 
essentially an addition on V with one exception: To indicate that the complete text has 
been processed, we use the special values (•,£) G V. We define: 

{{t + t', f + f ') if t ^ • and t + f < n , 
(•,£ + ift^.andt + t'>n, 
(•,£) otherwise. 

The last component to be specified is the transition function T. The shift, and therefore 
all possible target states, are determined by the current window contents, that means, 
the suffix of the current window must match the prefix of the subsequent window. We 
define an indicator function that tells whether two windows are compatible in this sense: 

I(w,w') := lw[shift p (w)..\ = w'[..z - I - shift" (w)]} . 

While I(w, w') tells whether a transition from w to w' is allowed, its probability is given 
by the text model. 

{P (c ^ c') if (w, c) = (e, c ) , 

I(w, w') ■ P [c — > d otherwise , 
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Number of text accesses Number of text accesses 

Figure 4: Exact distribution of the number of text accesses for Horspool's and Sunday's 
algorithms using a uniform i.i.d. text model over the alphabet {A, C, G, T}. Top 
vs. bottom row: pattern AAAAA vs. ACAGC. Left vs. right column: text length 
20 vs. 100. 



where 

j'-i 

P (co a °- aj - 1 ) c '/j ■= ^ JJ^(c-,ai,c- +1 ) 

c' 1 ,...,c' j _ 1 ec i=o 

is the probability that the text model (C, c , S, tp) produces the string a[0] . . . o~[j — l] G S J 
while going from state c' G C to state c'a G C in j steps. 

Theorem 7.1. Let a window-based pattern matching algorithm specified by z, £ p , and 
shift p and a text model (C,Cq, £,<£>) be given. The exact distribution of the cost of pro- 
cessing a random string of length n can be computed in 0(n 3 ■ |S| 2 ^ • \C\ ■ £ p ) time 
and 0(n 2 ■ |S| 2 • \C\ ■ £f) space. If the text model is Markovian of order r < z, then 
0(n 3 ■ \T,\ 2z ■ £ p ) time and 0(n 2 ■ |E| 2 • £ p ) space are sufficient. 

Proof. By construction, the PAA simulates the working of the specified algorithm. As- 
suming \£\ = 0(1), runtime and space bounds follow from Lemma 13.21 If the text 
model is Markovian of order r < z, then only |S| 2 + 1 different states are reachable. 
The claimed time and space bound can then be met by using a reduced state space 

q' ■= { £ } u m z . □ 
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The exponential dependency on the window length z allows practical computations 
only for short patterns. For a pattern length of 5 and text lengths 20 and 100, a 
comparison of Horspool's and Sunday's algorithms is shown in Figure HI The calculation 
took 1.8 seconds and 40.4 seconds for text length 20 and 100, respectively. The plots 
reveal the combinatorial nature of the number of text accesses (which is asymptotically 
normally distributed) for short texts. It also reveals that Sunday's algorithms needs 
more character accesses than Horspool's algorithm for the example patterns. 

8 Alignment Seed Sensitivity 

We describe an application of the PAA framework to determine the quality of seeds used 
in homology search. Homologous biosequences have developed from a common ancestor 
and usually share high sequence similarity. In homology search, a sequence database is 
searched for a query sequence in order to find potential homologs, i.e. evolutionarily re- 
lated sequences. To this end, each database sequence is compared with the query, which 
can be done by a local alignment method such as the Smith- Waterman algorithm [64J. 

8.1 Related Work and New Results 

Since exact local alignment is too slow in practice, most heuristic homology search 
algorithms are based on a two-phase filtration technique [561 El El ED] - First, candidate 
sequences are selected that share a common pattern ("seed") of matching characters 
with the query. These candidates (or "hits") are then further investigated by an exact 
method. Initially, contiguous seeds (e.g., perfectly matching DNA 11-mers in the initial 
BLAST implementation) were used. PatternHunter (PH) by Ma et al. [H] was the first 
tool to systematically advocate and investigate spaced seeds: PH looks for 18-mers with 
at least 11 matching positions distributed as 111*1**1*1**11*111, where 1 denotes 
a necessary match and * denotes a don't care position (match or mismatch). Over 
time, various seed models have been proposed in the literature, including consecutive 
seeds [56l [2] , spaced seeds [HJ [151 El US] , subset seeds [36] , vector seeds [12] , and indel 
seeds [43]. 

In the context of homology search, it is customary to model random alignments in- 
stead of random sequences. Typical models for such alignments may consist of several 
homology parameters an hence called homology models. They are described in Sec- 
tion [H2J Different seeds in a class (e.g. all seeds with 11 match positions and length 18) 
can be compared according to their sensitivity, i.e. the probability to "hit" a random 
alignment of given length from a given homology model (see Definition [83] in Section IHU1 
for a formal definition of "hit"). 

A good seed exhibits high sensitivity for alignments that model evolutionarily re- 
lated biosequences, and low sensitivity values for alignments that represent unrelated 
sequences. The latter property ensures that the seed does not detect too many random 

3 See http://www.rahmannlab.de/software for an implementation in JAVA. The experiments were 
run on an Intel Core 2 Quad CPU at 2.66GHz. 
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Table 1: Representative string A of an ungapped alignment between two sequences. 
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hits. Random hits decrease the efficiency of the filtration phase, since they are checked 
in vain for a significant alignment. An interesting finding was that the PH approach 
led to an increase in both sensitivity and filtration efficiency, compared to seeds of con- 
tiguous matches. Based on the observations in [UJ, the advantages of spaced seeds over 
consecutive seeds have been subsequently evaluated by many authors [T5J [T7J EH]- 

An extension to single seed models is the design of a multiple seed. This is a set 
of spaced seeds to be used simultaneously, such that a similarity is detected when it is 
found by (at least) one of the seeds. The idea to use a family of spaced seeds for BLAST- 
type DNA local alignment has been suggested by Ma et al. [UJ and was implemented in 
PatternHunter II [37J. It has also been applied to local protein alignment in [14]. Recent 
approaches [331 EZ1 approximate the sensitivity of multiple spaced seeds by means of 
correlation functions. Since finding optimal multiple seeds is challenging, most authors 
concentrate on the design of efficient sets of seeds, leading to higher sensitivity than 
optimal single seeds [3JJ E2J [66] . 

When searching optimal seeds, one faces the following problems to evaluate candidate 
seeds: 

Problem 8.1 (Sensitivity computation). Given a homology model, a target length n, 
and a set of seeds, what is the probability that a random alignment of length n is hit by 
the seed (at least once)? 

Problem 8.2 (Hit distribution). Given a homology model, a target length n, a set of 
seeds, and a maximal hit number K , what is the probability that a random alignment of 
length n is hit by the seed exactly k times (or at least k times), for each k = 0, . . . , K, 
when counting (a) overlapping hits, (b) non-overlapping hits? 

The second, more general question has not yet been investigated in the literature. As 
we show, the distribution is directly provided by the constructed PAA and allows the 
investigation of optimality criteria different from sensitivity alone. 

8.2 Homology Models 

We describe random alignments with known degree of similarity (e.g. a certain per cent 
identity value) by means of a homology model. A homology model generates representa- 
tive strings A over an alphabet £ indicating the status of the alignment columns. In the 
simplest and most frequently studied case only substitution mutations and ungapped 
alignments are considered [UJ [T51 [HJ [TBI El EE]; see Table [TJ That is £ = {0, 1}, 
referring to matches (1) and mismatches (0). 
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Indel seeds, designed for gapped alignments, use the alignment alphabet £ = {0, 1, 2, 3}, 
where additionally 2 denotes an insertion into the database sequence, and 3 indicates 
an insertion into the query sequence. There are various other alignment alphabets, e.g. 
the ternary alphabet representing a match or transition or transversion in DNA [51] , or 
even larger alphabets to distinguish different pairs of amino acids in the case of proteins 
|14j . A representative string is modeled as a Markov chain (S,P, p°) with a transition 
matrix P and an initial distribution p° on the alphabet S. P and p° are called homology 
parameters. 

Ungapped alignments. We model ungapped alignments by an i.i.d. homology model 
with E = {0, 1}. In this case, the transition probability P(cr, a') does not depend on a. 

(1 — p p\ 
1 P Pi a ma ^ C ^ P r °b a b'il'ity p G [0, 1], which 

quantifies the average identity of such alignments; for example, p ~ 0.3 for unrelated, 
p ~ 0.95 for closely related DNA sequences. 

Gapped alignments. For gapped alignments, we use a first-order Markov chain. This 
is appropriate since the respective homology model should prohibit the pairs '23' and '32' 
in a representative string, because a substitution is more plausible than two consecutive 
indels. For our calculations, we used the transition matrix P proposed in [13]: 
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(24) 



where po is the probability of a mismatch, p\ is the probability of a match, and p g refers 
to the probability of a gap in the alignment. In order to obtain a stochastic transition 
matrix, p* = p a + p^p a / (Po + Pi) for a G {0, 1} redistributes p g to match and mismatch 
characters. The initial distribution is given by p° = (po,Pi,Pg,Pg)- Other transition 
probabilities are possible, e.g. if alignments with affme gap costs should be modeled. 

8.3 Seed Models 

A seed 7r = 7r[0]7r[l] . . .ir[L — 1] is a string over an alphabet of "care" and "don't care" 
characters. It represents alignment regions that indicate matches at the "care" positions. 
A seed is classified (L,u) by its length L = \tc\ and its weight u, which refers to the 
number of "care" positions. 

A contiguous seed represents a region of contiguous matches, i.e. 7r[i] = 1 for < i < L. 
A spaced seed is a string over the alphabet H = {1, *}. The "care" positions are indicated 
by 1, while * refers to a match/mismatch wildcard. Reasonable seeds for the purpose 
of homology search always require 7r[0] = n[L — 1] = 1. An indel seed according to 
Mak et al. [13] is a string over the alphabet H = {1, *, ?}, where 1 and * are as above, 
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and ? stands for zero or one character from the alignment alphabet S = {0,1,2,3}. 
Two consecutive ? symbols represent any character pair except '23' or '32'. By means 
of this interpretation, the model explicitly allows for indels of variable size. For example, 
1??1 may detect indels of size 0, 1, or 2. It hence tolerates 2, 1, or match/mismatch 
positions. 

A seed can thus be converted to a generalized string (see Section I6.2.2j) over the 
alignment alphabet when we additionally allow to skip some characters (e-characters). 
In any case, a seed can be represented as a finite set of patterns over the alignment 
alphabet; this pattern set VS(n) contains all instances of the generalized string. 

Definition 8.3 (Hit). A hit of a seed it in an alignment A is an occurrenc^ of a string 
from VS(tt) in A. We call an ending position of a seed hit in A a hit position. 

Example 8.4. Consider the indel seed it = 1*111. It corresponds to the generalized 
string [l][01][l][e0123][l], and the instances are given by the pattern set 

VS(tt) = {1011, 1111, 10101, 10111, 10121, 10131, 11101, 11111, 11121, 11131}. 

. The seed hits the alignment string 1011011110 at positions 3, 6, 7, and 8, respectively. 

For a finite, non-empty set II = {iti, . . . , 7r m } of spaced seeds, also called multiple 
spaced seed, the patterns are collected in VS(TV) = U™ =1 'PiS>(7rj). A multiple seed is said 
to hit A, if at least one of its components does. 

8.4 PAAs for Seed Sensitivity and Applications 

With the seed (set) II represented as a pattern set VS(I1), and the alignment model being 
a finite memory text model (see Section IBTTj) . the problem has been reduced to computing 
the hit distribution of a finite set of patterns (Section l6.2.ip . As mentioned in Section l6T2| 
both overlapping and non-overlapping seed hits can be considered. In fact, Figure |2] in 
Section f6 . 2 . 1 1 shows the PAA for seed tt = 1*1 with pattern set VS(tt) = {111, 101} in 
an i.i.d. text model to count (a) overlapping and (b) non-overlapping hits. 

In contrast to previous work that only considers the sensitivity (probability of at 
least one hit, Problem 18. ip . the PAA framework yields the entire match distribution 
(Problem 18.21) . However, if only the sensitivity is desired, the value set can be reduced 
to V = {0, 1}, reducing requirements to (9(n|Q| 2 ) time and 0(|<2|) space. 

An example comparing a contiguous seed with the PH seed is shown in Table [2j For 
match probability p = 0.95 (an alignment of highly similar sequences), a good seed 
should achieve a high sensitvity, or low probability for zero hits; indeed, the PH seed is 
almost two orders of magnitude better than the contiguous seed (6.7- 10~ 6 vs. 4.1 ■ 10~ 4 
for k = hits). For p = 0.3 (essentially a random alignment), a good seed should achieve 
a low sensitivity. Both sensitivty values are comparable (6.75 • 10 -5 for the contiguous 
seed; 8.3 • 10 -5 for the PH seed). If we consider only candidates with at least two hits, 

4 A hit has previously been called a match or an occurrence, but here a match concerns a single position 
in an alignment, and the term "hit" seems more descriptive. 
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Table 2: Comparison of a seed requiring 11 contiguous matches in an alignment and 
the PH seed requiring at least 11 matches within 18 alignment columns at 
particular positions. The alignment model is ungapped i.i.d., with p being the 
probability of a match, and 1 — p being the probability of a mismatch. We 
consider both highly similar sequences (top, p = 0.95) and unrelated sequences 
(bottom, p = 0.3). The tables show the probabilities for exactly k overlapping 
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the PH seed retains its advantage for alignments of highly similar sequences and now 
outperforms the contiguous seed even for random alignments (probability of 3.5 • 10 -5 
for at least two matches for the contiguous seed, but only 2.2 • 10~ 7 for the PH seed). 

The PAA framework provides a unifying method for computing seed sensitivity for 
different alignment models and seed models that were so far developed in an ad-hoc 
fashion in the literature. Furthermore, it is easily extended; let us mention two examples. 

For a simple homology model with few parameters (say, the i.i.d. homology model 
for ungapped alignments with a single parameter p) , it is possible to evaluate the recur- 
rence E] symbolically, i.e., by representing the entries of the transition matrix T(q', q) and 
the probabilities of the state-value distribution ft{q,v) as polynomials in the parame- 
ters. The resulting polynomial only needs to be computed once; then one can assess the 
sensitivity of a seed under different parameter values. This was previously presented by 
Mak et al. [H], without using the PAA approach. 

The PAA framework can also be applied to design efficient sets of seeds. Similar to [SS] , 
we can successively find seeds that locally maximize the conditional sensitivity, given 
that the seeds already present in the set do not hit an alignment. Such a conditional 
sensitivity can be computed by making the existing seeds' accepting states absorbing 
without counting a match, so only instances of the current seed candidate are counted. 
Evaluating each seed candidate and picking the best one yields the seed to be added to 
the set. 

9 Protein Fragment Mass Statistics 

Peptide mass fingerprinting (PMF) is a technique for protein identification based on 
mass spectrometry (MS) and database search. Here, we present a PAA to compute the 
significance of protein identification by PMF. The protein of interest is enzymatically 
cleaved into smaller peptides, whose masses are determined by MS. Masses are measured 
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in Dalton (Da) or unified atomic mass units (u), where 1 Da = lu equals one twelfth 
of the mass of an isolated atom of carbon-12 ( 12 C) at rest and in its ground state, or 
1.66053878 ■ 10 _24 g. The set of peptide masses, the so-called peptide mass fingerprint, is 
then used to query a database of known proteins. Each database sequence is processed 
in silico to obtain a theoretical fingerprint, and experimental and theoretical fingerprints 
are subsequently compared and scored. The highest-scoring database protein is the most 
likely candidate for the unknown sample. 

In order to come up with a reasonable scoring for such a comparison, one is interested 
in the probability that a certain peptide fragment mass occurs by chance. For example, 
a mass of 445.2 Da is characteristic for the short fragment DVCK (aspartatic acid, valine, 
cysteine, lysine), which occurs in many known proteins, and is therefore not a helpful 
feature for protein identification. 

Additionally, fragment masses vary for several reasons: The constituting atoms have 
different isotope masses; about 98.9% of all carbon atoms, for example, have six neutrons 
while about 1.1% have seven neutrons and are therefore heavier. Post-translational 
modifications (additions of chemical groups to some amino acids of the protein) may 
occur. Some cleavage sites may be missed by the protease, and one may observe one 
larger mass that corresponds to the sum of two expected masses. 

Therefore, we are interested in the mass distribution of proteolytic fragments, or the 
joint length-mass distribution of such fragments, to subsequently answer questions about 
the probability that at least one fragment with a given approximate mass exists in a 
protein of given or typical length. 

An approach chosen by many PMF software packages is to avoid probability compu- 
tations and use empirical frequencies of fragments in large protein databases instead. 
Kaltenbach [2B] introduced p- value-based scores that are database- independent, but 
based on a text model that represents protein sequences as i.i.d. random strings, i.e. as 
sequences of independent random variables that take values from the alphabet of amino 
acids according to frequencies estimated from the Swissprot database [7J |8] . 

Here, we generalize this result and present a PAA based on arbitrary finite-memory 
text models. First, we construct a DAA encoding the enzymatic cleavage reaction. 
Second, from this DAA and a model for random sequences of amino acids, a PAA is 
constructed by means of Lemma 15.41 The resulting PAA provides statistics of peptides 
usually measured in an MS experiment, in particular, the length distribution and the 
joint length-mass distribution of such fragments. Furthermore, we compute the proba- 
bility that the peptide mass fingerprint of a random protein contains at least one peptide 
of a certain mass. This, in turn, provides a significance value for PMF identifications 
and can be used to derive a scoring scheme, as explored in [28]. We further incorpo- 
rate the influence of isotopic distributions, incomplete cleavage, and post-translational 
modifications by appropriate modifications of the PAA. 

9.1 PAA for Statistics of Proteolytic Fragments 

Cleavage enzymes cut proteins at well-defined places which are often determined by one 
or two adjacent amino acids (see [621 EB])- For instance, the most widely used cleavage 
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agent Trypsin cleaves after lysine (K) and arginine (R) unless the next amino acid is 
proline (P). Such cleavage sites can be described by Til with cleavage characters T C E 
and prohibition characters U C E, where n denotes the complement of II in the amino 
acid alphabet S = {A, . . . , Z} \ {B,J,0,U,X,Z}. 

Due to their distinct structure, we distinguish the first fragment F\ from following 
fragments F + . While the first fragment may start with a prohibition character, following 
fragments do not. For i.i.d. text models, statistics for all following fragments are i.i.d. 
as well (assuming an infinite string length), see [69, 28J. For arbitrary finite- memory 
text models, however, statistics of fragments F 2 , F 3 , . . . may differ as well. In any case, 
statistics for fragments Fk converge as k grows to infinity. We write F+ to denote an 
arbitrary fragment (either first or following) . The last fragment does not necessarily end 
with a cleavage character because of the finiteness of protein sequences. 

First, we construct a DAA V = (<2, q , E, 5, V, v , £, (f] q ) q( zQ, (6 q ) q& o) that sums up 
amino acid masses (for now, we ignore isotopes) until a cleavage point is encountered. 
We set Q := E U {go, •, °}, where q is the start state. For each state q G E, the emission 
r) q gives the mass of amino acid q; the set of possible values is V := R + , where vq := 
is the start value. Note that, despite the infinity of V, the number of values reachable 
in n steps (# n ) is finite for all n (see Section E]). For states q' G {qo,»,o} no mass 
is emitted, that means, r] q i = 0. To sum amino acid masses, we define operations by 
9 q : (v, e) h->- v + e for all q G Q. The transition function 5 : Q x E — > Q reflects the 
cleavage rules: 

'a if qe (E\r)U{g }, 
a if q G T, a G II , 

• if q G r,a ^ n, 
k o iiqe{;o}. 

As can be easily verified, this DAA has the property that 5(qo, s[..i + 1]) = • if and only 
if the first fragment in s has length i. Then, valuex>(s[..i + 1]) = m, where m is the mass 
of the first fragment. 

From the DAA and a text model, a PAA is obtained by using Lemma 15.41 To take 
the isotopic mass distribution of each amino acid into account, each state's emission 
distribution can be changed accordingly. For practical computations, it is sufficient to 
perform calculations up to a limited mass accuracy as the accuracy of MS instruments is 
limited as well. One could, for example, consider masses up to one decimal position. A 
sketch of a PAA for first fragments and a first-order Markovian text model is shown in 
Figure [5] (including isotopic distributions). From now on, we assume that masses have 
been scaled and then rounded to integers to achieve the desired precision. 

The difference between the first fragment F\ and the following fragments i^i^, ■ ■ ■ 
lies solely in the initial state distribution. The transition probabilities from the start 
state to other states need to be adjusted in such a way that (1) prohibition characters 
are forbidden at the beginning of a following fragment and (2) the transitions reflect the 
distribution of text model states at the end of the previous fragment. This distribution 
can be obtained from the PAA for the previous fragment. Note again that statistics for 
all fragments Fk with k > 1 are i.i.d. if the underlying text model is i.i.d. as well. 
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Figure 5: Sketch of the PAA measuring the mass of the first fragment resulting from 
tryptic cleavage of a random protein. Following fragments are handled by a 
modified PAA where the initial distribution is normalized by 1 — pp and start 
in P is prohibited since following fragments cannot start with a prohibition 
character. For the sake of simplicity, not all transitions are shown. The states' 
weight distributions correspond to the respective isotopic distributions. Here, 
integer masses are displayed. The operation associated to each state is "+" . 

9.2 Mass and Length Distributions 

By means of the constructed PAAs, we compute fragment statistics as the length distri- 
bution and the joint length-mass distribution. We denote the length of a fragment by 
L(F±) and the (integer) mass of a fragment by M(F+). The length of a random frag- 
ment generated by the PAA corresponds to the waiting time for reaching state • minus 
one (as the last processed character does not belong to the fragment). When only the 
distribution of fragment lengths (and not the mass distribution) is required, we can use 
Lemma 14.41 to compute the distribution of this waiting time. 

Generally, the joint state- value distribution f n (q, v) = P(Q n — q,V n — v) discussed in 
Section |3] yields the joint length-mass distribution of F*. 

i/*(n,m) := P(L(F*) = n, M(F*) = m) = / n+1 (.,m) = P(Q n+1 = ; V n+1 = m) . 

The computation of f n+ i takes 0(n ■ \Q\ 2 ■ $ n • time and 0(|Q| • space (see 
Lemma [3.2p . Since both \ Q\ and \S\ are constants, and the range of possible masses d n 
grows linearly with both n and the mass scaling factor (or mass precision) A, we need 
0(\n 2 ) time and O(Xn) space. 

9.3 Mass Occurrence Probabilities 

For the interpretation of mass spectra, the most important quantity is the mass oc- 
currence probability of a measured mass m, i.e. the probability that the fragmentation 
of a random protein sequence contains at least one fragment of mass m. To account 
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for possibly inaccurate measurements, it is advisable to compute the probability that 
a fragment in a certain mass range [m — A, m + A] occurs. We modify the PAA such 
that it does not move into the absorbing state o once the first fragment has ended. We 
therefore remove state o and replace • by |E| states named •„ for each a G E. That 
means, Q := {gojUEUjwo- : o G £}. Being in state * a means that a fragment has ended 
and the first character of the next fragment is a. Therefore, the emission distribution 
of 9 a is the same as for state a. In terms of the transition function S, each state •„ acts 
like state a. Formally, 



5(q,a) 



ifgG(E\r)U{g }, 

if q e r,cr g n, 
if q g r,a i n, 

if q = m a , . 



To keep track of whether a fragment in the given mass range has already been observed, 
we introduce an absorbing value o; that means, all operations are adapted such that once 
the value o has been attained, it is not changed. The new states » a get the following 
operations: 

o if v G [m — A, m + A] , 
otherwise . 



The probability that a protein sequence of length n contains a segment in the mass range 
[m — A, m + A] is given by P(V n — o) + P(V n G [m — A, m + A]), where the second 
summand accounts for the probability that the last fragment has the sought mass. 



9.4 Missed Cleavages and Post-Translational Modifications 

Protein identification by MS is complicated by the occurrence of partial enzymatic cleav- 
age which results in peptides with internal missed cleavage sites. Even for Trypsin, 
which is reported to have a high cleavage specificity, incomplete digestion is not uncom- 
mon [631 EI] ■ The presented PAA can be modified to account for missed cleavages by 
adjusting the transition probabilities outgoing from the cleavage characters: The prob- 
ability to end a fragment after reading a cleavage character is reduced, while transitions 
from a cleavage character to non-prohibition characters occur with a small probability. 
Moreover, in a Markovian text model, one can include information about missed cleavage 
patterns, i.e. the probabilities can be weighted according to the amino acid propensities 
in proximity to missed cleavage sites. 

Another complicating issue is that many proteins are modified after translation. A 
post-translational modification (PTM) is a chemical process that changes the properties 
of a protein. It includes the addition of functional groups such as acetate or phosphate, 
the modification of amino acids, or structural changes such as the formation of disulfide 
bridges. Putative modifications can themselves be discovered by means of MS, com- 
paring the mass measured to the mass expected for the identified protein or peptide 
[U H3 [68]. PTMs result in a change in the molecular mass of the protein. Examples 
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of common modifications include phosphorylation (+80 Da), acetylation (+42 Da), and 
methylation (+14 Da or +28 Da). An overview of important PTMs along with the re- 
sulting mass shifts in Da is provided in the review of Mann and Jensen [IS]. The PAA 
presented above can be modified to incorporate PTMs as follows: "Global" modifica- 
tions of the protein are incorporated into the mass distribution of either the start state 
go or of an end state o or •. Amino acid-specific modifications are incorporated by mod- 
ifying the mass distribution of that particular amino acid state, possibly splitting the 
state into several copies accounting for different modification probabilities. Reasonable 
frequencies of PTMs associated with particular amino acids are provided in a study of 
Tsur et al. [68J. 

9.5 Characteristic Masses of Protein Families 

PAAs provide the possibility to combine any probabilistic protein model with a cleavage 
scheme and, hence, to obtain a probabilistic PMF for the model. This allows to compute 
statistics of "random" peptide fragments, as shown above, but also to compute charac- 
teristic masses of protein families, which are often represented by probabilistic models, 
such as HMMs in the Pfam database [22]. As character-emitting HMMs are special cases 
of finite-memory text models (see Section EU]) , we can combine a Pfam model with a 
DAA to compute fragment statistics specific to the protein family. 

Mass m is said to be specific for a protein family (for a particular cleavage scheme, 
with respect to a set of protein families) if the occurrence probability of fragments with 
mass m is greater than 1 — e for some small e > and if fragments of mass m are not 
expected in any other protein family. We are not aware of any work done in this area; 
hence it is unkown if family-specific fragment masses exist for any cleavage scheme. 

10 High-Throughput Sequencing 

With the 454 sequencing technology, up to a million DNA reads of lengths up to 400 
nucleotides can be determined in parallel by pyrosequencing. Nucleotides are successively 
synthesized to single-stranded DNA templates evoking an enzymatic cascade, resulting 
in detectable flashes of light. These signals are reported in a so-called pyrogram from 
which the sequence of nucleotides (the read) is determined. In 454 sequencing, the 
four different nucleotides are sequentially flowed over the reaction plate which holds the 
DNA templates. The particular order of nucleotide flows is called dispensation order 
and repeated for several (say, 100) cycles. The 454 systems GS20 and GS-FLX use 
the standard dispensation order TACG. Whenever the next nucleotide in the template 
sequence matches the currently dispensed nucleotide (in reality, its complement, but this 
is a technical detail), the dispensed nucleotide extends the complementary strand of the 
template. In case of a homopolymer run (the same type of nucleotide appearing several 
times consecutively), the strand is extended by the corresponding number of nucleotides 
at once. Such an event is detected by a correspondingly more intense flash of light. 
The more recent IonTorrent technology has the same characteristics, but uses a different 
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Table 3: Depending on the dispensation order, different numbers of nucleotides can be 
recognized during three cycles. The dispensation orders TACG and GTCA are 
compared for the sequence GTCGTATCCC. Note the homopolymer run of three 
Cs, which is sequenced with a single flow. 
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underlying technology (change in pH instead of light detection). Table [3J shows how the 
sequence of a template fragment is determined with two different dispensation orders. 

We investigate the exact length distribution of 454 sequencing reads, assuming a fixed 
number of nucleotide flows, a given dispensation order, and an arbitrary finite-memory 
text model. Previously, Rahmann [57] studied the combinatorics of sequences that can 
be reliably sequenced by the 454 technology. Kong [54] obtained generating functions 
for the length distribution, but the dispensation order was restricted to a permutation 
of the nucleotides, and the tex model was restricted to an i.i.d. model. 

Here we design a PAA that allows for arbitrary finite-memory text models and arbi- 
trary dispensation orders: Every ordering of nucleotides, where each nucleotide occurs at 
least once and no nucleotide is flowed twice consecutively, is reasonable, e.g., TCGACG. 

10.1 PAA for the Length Distribution of 454 Reads 

Instead of attacking the length distribution problem directly, we first construct a DAA 
that computes the number of nucleotide flows needed to read (a finite prefix of) an input 
sequence with a given dispensation order d = d[0] . . . d[£ — 1]. In other words, in each 
step, the DAA reads one nucleotide to be sequenced, and the emitted value models the 
waiting time (number of flows) since the previous sequenced nucleotide. We cast the 
length distribution problem as a waiting time problem by waiting for a fixed number / 
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of flows (Section HI waiting for a value, Lemma [4.2(1 . As a consequence, we obtain the 
exact length distribution £(L) = C(L(f, d)) of reads sequenced after / nucleotide flows 
according to a dispensation order d. In practice, / is usually a multiple of t — \d\. 

We define a DAA with state set Q s := (Sx{0,...,f - 1}) U {(e, *)} with start 

state go := (e, — 1) and attach the following semantics: Being in state (a, j) means that 
the last two sequenced nucleotides were a and d[j]. 

The (deterministic) emission of state (er, j) tells us how many flows it took to reach 
nucleotide d[j] after a: If d[j] = a, they are part of the same homopolymer run and the 
emission is zero; otherwise, it is the smallest forward distance between these nucleotides 
in the (cyclically interpreted) dispensation order. For example, if d = TCGACG, a = 
G, and j = 1, so d[j] = C, then the emission is 2 (skipping over d[0] = T). Formally, 

rjfaj) := min {i E {0, ...,£ — 1} | d[j — i mod £] = a}. (25) 

The transition target from state (a, j) when reading character a' is consequently 

5((a,j),a') := (d\j] , j + min {i E {0, . . . , i - 1} | d[j + i mod £] = a'} mod t\\ 

this is also valid for a = e. Intuituvely, the "old" j becomes the "new" a = d[j], and 
the "new" j is obtained by cyclically searching forward in the dispensation order. The 
initial transition targets and emissions are given by 

<y((e,-l),<7') := (e, min {i e {0, ...,£- 1} | d\i] = a'}); 
ri {e>j) :=] + ! for j E {-1, ...,£- 1}. 

The sequencing protocol specifies the number / of nucleotide flows; typically / = 400 
for the standard dispensation order TACG, i.e. 100 cycles. The operation in each state 
adds the emitted number of flows, truncating at / + 1. This completes the construction 
of a DAA. 

We are interested in the sequence length distribution up to a given length n. We 
define the target set T := {/ + 1} (cf. Definition 14. ip and obtain the read length as one 
less than the waiting time 

W T = min{t G N | V t = f + 1} . 

Invoking Lemma 15.41 yields the corresponding PAA for a general finite-memory text 
model, from which we obtain the read length distribution up to a given n. Applying 
Lemma l5T6T s statement about waiting times with E = 0(1), \C\ = 0(1), \Q V \ = 0(£) 
and dn = 0(£n) results in a running time of 0(n 2 £ 2 ) and a space requirement of 0(n£ 2 ). 

As an application, we compared the expected read lengths under an estimated first- 
order Markov model for all reasonable dispensation orders of length 4 to 10 on two 
datasets of 454 reads of yet unfinished strains of the GC-rich bacteria S. meliloti and 
R. Lupinii, provided by the Genetics Department of Bielefeld University. We observed 
that the standard machine settings could be improved to yield approximately 10% longer 
reads on average, namely 282 nt instead of 257 nt with the GS-FLX instrument during 
/ = 400 nucleotide flows. 
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11 Discussion 



We have presented the concept of probabilistic arithmetic automata that blends well 
into the landscape of existing stochastic models like Markov chains and hidden Markov 
models. In fact, PAAs can be seen as Markov chains over a larger state space. The 
benefit of PAAs lies in their utility as a modelling technique, specifically (1) the required 
state space is often small, (2) the connection between states and values becomes evident, 
and (3) an elegant notation and well-arranged recurrences are obtained. As shown in 
the second part of this article, many applications can conveniently be approached using 
the PAA framework. The method is especially suited for applications involving the 
deterministic processing of random texts. In these PAA can be constructed 

from a deterministic arithmetic automaton and a finite-memory text model, a quite 
general class of text models that covers simple i.i.d. models as well as arbitrary-order 
Markov models and character-emitting HMMs. 
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